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Abstract. Models of weak-scale supersymmetry offer viable dark matter (DM) candidates. 
Their parameter spaces are however rather large and complex, such that pinning down the 
actual parameter values from experimental data can depend strongly on the employed sta- 
tistical framework and scanning algorithm. In frequentist parameter estimation, a central 
requirement for properly constructed confidence intervals is that they cover true parameter 
values, preferably at exactly the stated confidence level when experiments are repeated in- 
finitely many times. Since most widely- used scanning techniques are optimised for Bayesian 
statistics, one needs to assess their abilities in providing correct confidence intervals in terms 
of the statistical coverage. Here we investigate this for the Constrained Minimal Supersym- 
metric Standard Model (CMSSM) when only constrained by data from direct searches for 
dark matter. We construct confidence intervals from one-dimensional profile likelihoods and 
study the coverage by generating several pseudo-experiments for a few benchmark sets of 
pseudo-true parameters. We use nested sampling to scan the parameter space and evalu- 
ate the coverage for the benchmarks when either flat or logarithmic priors are imposed on 
gaugino and scalar mass parameters. The sampling algorithm has been used in the config- 
uration usually adopted for exploration of the Bayesian posterior. We observe both under- 
and over-coverage, which in some cases vary quite dramatically when benchmarks or priors 
are modified. We show how most of the variation can be explained as the impact of explicit 
priors as well as sampling effects, where the latter are indirectly imposed by physicality con- 
ditions. For comparison, we also evaluate the coverage for Bayesian credible intervals, and 
observe significant under-coverage in those cases. 

Keywords: dark matter theory, dark matter experiments, cosmology of theories beyond the 
SM, supersymmetry and cosmology 
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1 Introduction 

Over the past few years, our knowledge about the Universe and its basic constituents has 
been significantly enhanced. Thanks to various high-precision astrophysical and cosmological 
observations, we have been able to construct a standard and very successful theoretical model 
of our Universe (for an introduction, see Refs. [1, 2]). 

This coherent picture, however, requires introducing new ingredients to our understand- 
ing of the Universe's elementary building-blocks and their interactions. If our current stan- 
dard model of cosmology is a correct description of Nature, we need, in particular, to identify 
the nature of its two essential elements, namely dark matter (DM) and dark energy (DE), 
which together make up more than 95% of the entire energy budget of the Universe (see e.g. 



As far as the DM problem is concerned, there has recently been remarkable progress 
towards the understanding of its nature (for some reviews, see e.g. Refs. [10-13]). From 
the theoretical point of view, there is indeed no lack for number or diversity in viable DM 
candidates. Perhaps the dominant paradigm is the conjecture that DM is composed of 
weakly interacting massive particles (WIMPs). Many theories beyond the standard model 
(SM) of particle physics offer such particles as their natural ingredients. The lightest neu- 
tralino or gravitino in supersymmetric (SUSY), and the lightest Kaluza-Klein (KK) particle 
in extra-dimensional extensions are some examples. These models however still require fur- 
ther experimental constraint, or verification. 

In addition to cosmological observations, which deal with pure gravitational properties 
of DM particles, there are several other experiments that can probe additional character- 
istics of DM. These are mainly divided into three categories: direct detection (DD; where 
a WIMP is sought by its interaction with atomic nuclei of some target material), indirect 
detection (ID; where signals from WIMP self-annihilation in astrophysical sources are sought 
in gamma- or other cosmic rays), and accelerator searches (where WIMPs are expected to 
be produced through high energy processes that happen at particle colliders). Needless to 



Refs. [3-9]). 
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say, the nature of DM can be claimed as having been eventually identified within a specific 
theoretical framework only if the results of all different experiments agree with a high level 
of confidence. 

Perhaps the most popular model for DM is that of SUSY, in particular, in the framework 
of the Minimal Supersymmetric Standard Model (MSSM) when the lightest SUSY particle 
(LSP) is the lightest neutralino (for a general introduction to softly-broken weak-scale super- 
symmetry, see Refs. [14-16], and for an introduction to supersymmetric DM in particular, see 
Ref. [17]). The MSSM assumes conservation of i?-parity, making the LSP absolutely stable. 
If the neutralino (which is electrically neutral and weakly interacting) is the LSP, it can be 
considered as an interesting candidate for cold DM. 

The MSSM however contains more than a hundred free parameters which are highly 
unconstrained. There have been several attempts at comparing MSSM predictions with 
observation to pin down the favoured values of the parameters, but every such inference has 
turned out to depend strongly on the scanning technique, statistical framework and measure, 
and our prior knowledge about the employed parameterisation and the actual values of the 
parameters (see e.g. Refs. [18-21] and references therein). Even after a dramatic reduction of 
the number of model parameters, e.g. in models like minimal supergravity (mSUGRA) [22] 
or the constrained MSSM (CMSSM) [23], the parameter space is still very complex such 
that many phenomenological analyses of the model are still far from being conclusive. The 
situation will improve though when more data become available in the future, for example, 
by the Large Hadron Collider (LHC) [24, 25] or more powerful DD experiments [25, 26]. 

In this paper, we highlight yet another problem in SUSY parameter estimation and 
statistical inference: the degree to which stated confidence/ credible intervals on fitted model 
parameters do in fact contain the intended amount of probability. This is a statistical issue 
that needs to be adequately controlled in order to make robust statements about any theoret- 
ical model, in particular, SUSY models in the context of DM searches. We demonstrate the 
problem through a case study where some data from a typical DD experiment are employed 
and the analysis is performed using a state-of-the-art and widely-used scanning technique 
optimised for Bayesian inference. 

In Sec. 2, we introduce the problem with a brief review of Bayesian and classical (fre- 
quentist) statistics, and after describing our specific example in Sec. 3, we present and discuss 
our results in Sec. 4. We summarise and conclude in Sec. 5. 

2 The problem 

2.1 Bayesian and frequentist approaches to SUSY parameter estimation 

Suppose that we have a theoretical model with m free parameters = (9%, 02, m ) and also 
suppose that the model has different physical predictions corresponding to each particular 
set of parameters 6. In general, these predictions can be of two types: (1) they are physical 
observables with fixed values that can be computed theoretically, or (2) they are observables 
that are stochastic by nature (because the physical processes involved in their predictions are 
e.g. thermodynamic or quantum mechanical). In the latter case, the intrinsic stochasticity 
of the observable implies that the theoretical predictions of the model at hand are given in 
terms of probability distribution functions rather than fixed quantities. However, even in 
cases where the predicted quantity has a fixed value, there are various extrinsic sources of 
uncertainty that make a statistical analysis of the model unavoidable. These uncertainties 
can be in both predicted and measured values of the observables. Theoretical uncertainties 
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come from various approximations one makes in deriving the theoretical values from the 
underlying model, and experimental uncertainties are caused by statistical and systematic 
errors associated with imperfection of the corresponding experimental apparatus. Another 
source of uncertainty is our ignorance of potential backgrounds and this can be estimated 
either theoretically or experimentally. Let us now assume that n such quantities exist with 
experimentally measured values of D = [d\, cfe, d n ). The question then becomes: how can 
one extract useful information about the actual values of the model parameters by looking 
at the measured observables D, taking into account all sources of errors and uncertainties? 

Statisticians' answers to this question fall primarily into two categories, which have to 
do with two conceptually very different definitions of probability and how it is assigned to 
parameters and measurements: Bayesian and frequentist (classical) statistics (for a detailed 
discussion, see e.g. Ref. [27]). 

Suppose the probability of obtaining the data set D for a particular set of theoretical 
parameters O by making some measurements is given in terms of the probability density 
function (PDF) P(D\Q) (where J P(D\Q)dD = 1). This is assumed to include all possible 
contributions to our uncertainty about the measurements. From the point of view of a 
Bayesian statistician, it is perfectly rational to also speak of P(Q\D), i.e. the PDF associated 
with the probability of the particular set of model parameters being 'true' if the particular 
set of experimental data D is obtained (for an introduction to general applications of Bayesian 
statistics in physics, see e.g. Ref. [28], and for reviews of its applications in cosmology, see 
e.g. Refs. [29-32]). This is because probability for a Bayesian indicates the degree of belief 
about a statement or hypothesis. The hypothesis in the case of parameter estimation is that 
a specific set of model parameters is the true set of values. In this case j P(@\D)d@ = 1. 
The two quantities P(D\Q) and P(Q\D) are then related by applying Bayes' Theorem 

Here, P(0) represents our degree of belief about being the true values of parameters, 
'prior to' our inclusion of the information given by the data D; P(0) is simply called the 
'prior'. P(D) denotes the total probability of obtaining the particular set of data D given 
the theoretical model and irrespective of the actual values of the model's parameters; P{D) 
is called the 'evidence'. Our updated knowledge about the values of model parameters is 
all embodied in the quantity P(Q\D) on the left hand side of the equation, which is called 
the 'posterior' PDF. The ultimate objective in any Bayesian analysis of a model is then to 
accurately compute and map the posterior as a function of 0. In this case, P(D\@) on the 
right hand side (which is a PDF in the data space) is considered as a function of and is 
known as the 'likelihood function'; we denote it accordingly C(Q). 

One major issue for Bayesian parameter estimation is how to choose the prior -P(O). 
Although it reflects our prejudice about the parameter values, one may think that the 'nat- 
ural' choice for being unbiased in this respect is to work with uniform (or flat) priors for all 
parameters. This however does not help, because it depends upon the way the model is pa- 
rameterised, or equivalently upon the metric defined on the parameter space. An alternative 
way to choose the priors, which is perfectly acceptable and actually strongly recommended, 
is to use the posterior from a previous inference cycle as the prior for the next. One should 
however notice that the issue still exists for the initial choice of the priors in this inference 
chain. In cases where a large number of inference cycles are used or the data are constraining 
enough, the impact of the initial priors can become weak or negligible. In existing statistical 
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analyses of SUSY models, the effects of priors are very strong compared to data, making 
their selection an important issue. 

To understand when the effects of priors become important, we should compare the value 
of the likelihood £(G) with that of the prior at all G; the data are labelled 'constraining' if 
the likelihood dominates and 'unconstraining' if priors do. 

Contrary to Bayesian statistics, frequentism defines the probabilities in a manifestly 
'objective' way. Suppose that we have a repeatable experiment which gives a set of experi- 
mental outcomes. The probability of a certain outcome D is then defined as the fraction of 
times that D happens when the experiment is repeated an infinite number of times such that 
every time the repetition is equiprobable. Given a point G in the model parameter space, this 
probability is equal to the likelihood C(Q). Since one cannot assign objective probabilities 
to the model parameters in a similar way, the posterior PDF P(0|_D) is meaningless and the 
only well-defined quantity in this framework is the likelihood £(G). 

2.2 Credible intervals, confidence intervals and statistical coverage 

The central task of any statistical parameter estimation procedure is, in addition to giving an 
estimate of the value for a model parameter, to tell us how well the parameter is estimated, 
i.e. to provide uncertainties in the determined value. This is performed differently in Bayesian 
and frequentist frameworks by providing the so-called credible and confidence regions (or 
intervals), respectively. 

In Bayesian statistics, a direct corollary of Eq. 2.1 is that if we are interested in joint 
posterior PDFs for some parameters or the one-dimensional (ID) PDF for a single parameter, 
they can be obtained by simply integrating (or 'marginalising') over the other parameters. 
For example, the ID 'marginal' posterior PDF for the parameter 9i is 

F(9i\D) = J P{Q\D)d9 1 ...d9 l _ 1 d9 l+1 ...d9 m . (2.2) 

Suppose that we are now interested in ID intervals for a parameter 9i corresponding to a 
confidence level a. 1 This is performed in Bayesian inference by finding an interval [9j, 9f] for 
which 

«? 

F(9i\D)d9i = a. (2.3) 

1 

i 

These intervals are called 'credible intervals'. Clearly, there are an infinite number of intervals 
that can be constructed given this prescription. One can specify the interval by for example 
constructing the shortest interval, the symmetric interval around the posterior mean, or just 
the upper/lower limits on the parameter value. 

In frequentist inference, on the other hand, where no marginal posterior ¥(9i\D) exists, 
one follows the Neyman definition of 'confidence intervals' which are constructed only from 
the likelihood function £(Q) [35]. The procedure is as follows: suppose that a confidence 
interval [9j,9f] is constructed such that it corresponds to a confidence level a; 9j and 9f 
are functions of the measured experimental data D and the true parameter is believed to 
reside in the interval. The requirement for this interval to be constructed properly is that if 
the experiments are repeated ./V times and new intervals are reconstructed using the same 

1 Strictly speaking, this should be called a 'credibility level' for Bayesian statistics, but we adhere to 
'confidence level' here since the term is widely used in the literature also for Bayesian inference [27, 33, 34]. 
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procedure as the original one, these intervals contain the true value M times such that 

M , \ 

lim — = a. (2.4) 

One immediate consequence of the Neyman procedure for constructing frequentist confidence 
intervals is the requirement that Eq. 2.4 should hold for any hypothetical true values of model 
parameters. That is, if we assume that any arbitrary values are chosen by Nature, simulate 
several pseudo-experiments based on the PDF P(D\Q), and work out the confidence intervals 
corresponding to a confidence level a for the entire ensemble of experiments, we expect to 
'cover' the pseudo-true values in a fraction a of cases. This requirement for correct 'coverage' 
serves as a powerful measure of success for any statistical method in extracting frequentist 
confidence intervals (for a discussion on statistical coverage, see e.g. Ref. [33]). 

Obviously, a perfect construction of confidence intervals gives exact coverage. Approxi- 
mate methods however can give rise to both 'under'- and 'over'-coverage. Under-coverage at 
a given confidence level a means that at least for one parameter, the fraction of realisations in 
which the pseudo-true value is contained in the corresponding confidence interval is smaller 
than a. Over-coverage is defined in an analogous manner but for fractions larger than a. 

Although both under- and over-coverage are problematic, the former is much more 
severe; it shows that our estimates of the true values for model parameters cannot be trusted 
since the true parameters may be located anywhere outside the intervals. Over-covering 
intervals, on the other hand, simply provide conservative estimates of the parameters. 

One frequentist technique which provides correct coverage at a desired confidence level 
is the so-called 'confidence belt' construction proposed by Feldman & Cousins [33]. This 
technique however requires the full power of Neyman construction which is rather hard to 
implement numerically, especially for models with a large number of parameters. 

Therefore in practice, instead of a full Neyman construction of the confidence intervals, 
one employs some approximate methods to estimate the intervals in a numerically feasible 
and sufficiently accurate way. Most of these methods are based on assuming a Gaussian shape 
for the likelihood function in both data and parameter spaces, an assumption that breaks 
down for cases with complex parameter spaces and insufficient data (i.e. low statistics). It 
is important to note that for cases with high statistics, the likelihood becomes Gaussian in 
the data, where it is a PDF, and not in the parameters where it is not a PDF. Therefore 
the approximate methods for constructing the confidence intervals do not necessarily become 
exact even in the limit of high statistics. 

One of the various proposals for making the approximation, which is widely used in 
SUSY parameter extraction, is based on the 'profile likelihood'. In this method, in order to 
make sensible inference about some parameters, one maximises (or 'profiles') the likelihood 
along the unwanted parameters in the parameter space [36, and references therein]. As an 
example, the ID profile likelihood L(0j) for a single parameter 9i becomes 

L(0i) = max £(0). (2.5) 

6\,...,6i—\fii-\-\,...,6 m 

To construct the confidence intervals, one first finds the point with the highest likelihood, and 
then gives uncertainties upon the parameter values by constructing iso-likelihood contours 
in the parameter space depending on the particular confidence level and the number of 
remaining free parameters after profiling over the unwanted ones. 

It is important to remark that the inferences based on profile likelihoods and marginal 
posteriors (Eq. 2.2) do not agree in general, especially when we lack enough experimental 
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data and the parameter space is complex. This is the case for SUSY models with currently 
available data. 

In addition, as stated earlier in this section, if the likelihood has a Gaussian form in 
both data and parameters, all approximate methods for constructing confidence intervals, 
including the profile likelihood scheme, are expected to give exact coverage. This means that 
in general, one does not expect proper coverage for a complex parameter space even when 
sufficient data are available (i.e. for the case of high statistics). This is because although the 
likelihood becomes Gaussian in the data space, it might still be very different from a Gaussian 
function in the parameter space. Therefore, in general, for complex parameter spaces either 
with high or low statistics one usually expects poor coverage from the profile likelihood; it 
gives both under- and over-coverage [33]. 

It should be noted that even in cases where the profile likelihood is expected to give cor- 
rect coverage, it does so only if it is mapped accurately enough. However, most of the popular 
scanning algorithms employed for this purpose are designed and optimised for Bayesian anal- 
yses. These include Markov-Chain Monte Carlos (MCMCs) and methods based on nested 
sampling [37, 38] such as MultiNest [39, 40]. Samples based on these techniques are affected 
by the choice of priors, which in turn impact the mappings for profile likelihoods. It is there- 
fore interesting to know how good these techniques are in a frequentist framework, and one 
way of testing this is to study the degree of coverage one can achieve by employing them; 
this is the main goal of this paper. 

On the other hand, for Bayesian credible intervals, one does not expect correct coverage 
since the definition of the intervals is established upon entirely different bases. However, it is 
probably true that most physicists would calibrate even Bayesian credible intervals using a 
frequentist approach, i.e. repeated experiments [41]. Therefore, it is interesting to also look 
at the credible intervals in terms of the statistical coverage. 

3 The case study: CMSSM scans and direct detection of dark matter 
3.1 Theoretical setup and scanning strategy 

The SUSY model we analyse here is the CMSSM [23] . Here, inspired by the simplest gravity- 
mediated SUSY-breaking mechanism called minimal supergravity (mSUGRA), and the nat- 
ural connection between SUSY and grand unified theories (GUTs), all scalar, gaugino and 
trilinear mass parameters of the MSSM Lagrangian are unified at the gauge coupling unifi- 
cation scale (~10 16 GeV). These are denoted by mo, tuxm and Aq, respectively. By requiring 
electroweak symmetry breaking (EWSB) to be fulfilled at low energies, the CMSSM possesses 
only five new free parameters compared to the SM, 

m ,m 1 / 2 , Aq, tan/3, sgn/x, (3.1) 

where tan j3 is the ratio of the up-type and down-type Higgs vacuum expectation values and 
H is the Higgs/higgsino mass parameter in the MSSM superpotential. The magnitude of fi is 
fixed by the EWSB requirement and only its sign remains to be determined by experiment; 
this implies four continuous and one discrete parameters for the model. 

In this paper, we fix \i to be positive, reducing the number of parameters by one. 
We instead add four SM quantities to our set of free parameters as 'nuisance' parameters: 
the pole top quark mass (mt), the bottom quark mass evaluated at its equivalent energy 
scale (rrib(mb) MS ), and the electromagnetic and strong coupling constants both evaluated 
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at the Z-boson pole mass {cx era {fnz) M and a s (mz) MS , respectively). MS stands for the 
modified minimal subtraction renormalisation scheme, and its appearance as superscripts 
means that the corresponding quantities are computed in that scheme. These nuisance 
parameters are the SM parameters with the largest uncertainties and strongest impacts upon 
CMSSM predictions. 

The resulting eight-dimensional parameter space is used in this paper to study the 
statistical coverage. The general MSSM contains a much larger and more complex parameter 
space, but its analysis requires a huge amount of computational power. We have chosen the 
CMSSM as a testbed model to show the problem with coverage. We believe that the problem 
becomes more severe when the full MSSM is analysed. 

One important ingredient of any coverage study is scanning over model parameters 
to construct confidence intervals. This process should be performed several times. Simple 
scanning techniques, such as fixed grid or purely random scans turn out to be useless even 
for low-dimensional SUSY parameter spaces, including the CMSSM (For a recent review of 
different scanning techniques, see e.g. Ref. [20] and references therein). This means that one 
needs to work with more sophisticated scanning algorithms that are fast and efficient. 

Most existing scans are based on either MCMCs or nested sampling. These are both 
optimised for Bayesian scans (for an example of frequentist scanning techniques, see e.g. 
Ref. [20]). The two methods are shown to give very similar results, but the latter takes 
significantly less computational effort (see e.g. Ref. [19]). We therefore work with nested 
sampling throughout this paper. The particular algorithm is called MultiNest [39, 40] and is 
implemented in SuperBayeS [19, 42, 43] (available from Ref. [44]), a publicly available pack- 
age for exploring the CMSSM parameter space and comparing its predictions with different 
experimental data. SuperBayeS includes many standalone packages as submodules which cal- 
culate different quantities for a given point in the CMSSM parameter space. Packages that 
we have used in our analysis are SOFTSUSY [45] (available from Ref. [46]) for solving renor- 
malisation group equations (RGEs) from the GUT scale down to the electroweak scale where 
observable quantities such as the neutralino mass are defined, and DarkSUSY [47] (available 
from Ref. [48]) for calculating various DM DD quantities that we use in constructing our 
likelihood function; we discuss this in the next section. 

3.2 Direct detection likelihood, benchmarks and pseudo-experiments 

A central quantity in any statistical parameter scan is the likelihood function. SuperBayeS 
provides the possibility of combining likelihood contributions from experimental constraints 
on different types of observables. These include various existing collider and cosmological 
data with a potential for including DM direct and indirect detection constraints in a global-fit 
framework. The ideal strategy is then to incorporate all these data in our analyses. However, 
in order to evaluate the coverage, we need to simulate the considered experiments several 
times for each selected set of pseudo-true parameter values and then scan over the param- 
eter space based on each set of generated data. This is a lengthy process and particularly 
cumbersome when contributions from certain observables (such as the DM relic density) are 
taken into account. Again, analogous to our strategy in selecting the model, we construct 
our likelihood function such that it reflects interesting properties of the model in terms of 
our particular problem, while not being overly computationally demanding. The choice we 
make is to work only with DM DD pseudo-data. 

DD experiments aim to observe low energy nuclear recoils induced by the scattering of 
relic neutralinos off nuclei in the detector [17, 49-51]. For an experiment that observes ./V 
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events with energies Ei, we define 

C DD (Q) = P(N\(j,(Q)) J] f(E t \&) (3.2) 

i=l. .N 

as the likelihood of the experimental result, where denotes the CMSSM and SM nuisance 
parameters. Here, P(N\fi) is the probability of seeing N events for a Poisson distribution 
with average fi, while 

is the probability of an event having an observed energy E over an energy range E1—E2, and 
Nf is a normalisation factor. The quantity 4p is the expected event spectrum after accounting 
for various analysis cuts and incorporating the finite energy resolution of the detector; see 
Ref. [26] for a description of how it is calculated. In principle, background processes can 
contaminate the signal and contribute to the event spectrum 4^; however, we do not include 
any background contributions here. The expected number of events and energy spectrum 
depend on the neutralino mass m^o , the spin-independent neutralino-proton scattering cross- 
section a^ 1 , and the two spin-dependent neutralino-nucleon scattering cross-sections dp k, 
a^ D , each of which are fixed by the parameters in a given CMSSM model. 2 For the DD 
experiment and SUSY models we consider in this paper, spin-dependent interactions are not 
significant and will not be further discussed (though they are included in the calculations). 

We take for our hypothetical DD experiment a XENONlO-like detector [52] with an 
exposure of 1000 kg-days. We assume an allowed event energy range of 2-75 keV, take 
efficiencies from Ref. [53], and use an energy resolution of [54] 

cr{E) = (0.579 keV) y/E/teiV + 0.021£ . (3.4) 

The 2 keV threshold of the XENON10 analysis presented in Ref. [53] was based upon the 
understanding of the XENON10 detector behaviour at the time. However, more recent 
detector energy calibration measurements suggest the actual XENON10 energy threshold is 
higher than 2 keV (see e.g. Ref. [55] and references therein) and the energy resolution at 
low recoil energies is much more complicated than as described above [56]. However, we 
are simply using these characteristics for a hypothetical experiment and these issues are not 
relevant for the purposes of this paper. 

With this choice of the likelihood function, we then need to choose some benchmark 
points in the parameter space representing the pseudo-true parameters for our coverage study. 
These points can in principle be selected arbitrarily since the coverage requirement should 
hold for all points regardless of being well-fitted or not. The ideal situation is to evaluate the 
degree of coverage corresponding to every single point in the parameter space, but due to 
our limitation in the computational power, we just pick a couple of points for our analysis. 
We call these points benchmark 1 and benchmark 2. Table 1 shows values of all CMSSM and 
SM nuisance parameters at the two benchmarks, as well as the corresponding DD quantities 
m^o and a^ 1 . 3 



2 A fourth cross-section erf 7 , for spin-independent neutralino- neutron scattering, is nearly identical to ai^ 1 
in the CMSSM and is not treated as an independent parameter. 

3 We should remark that indeed our two benchmark points '1' and '2' have not been chosen completely 
arbitrarily. They correspond to the posterior mean and the best-fit (highest-likelihood) point, respectively, 
for a scan based on the 13 actual events and event energies observed by XENON10 [53] (albeit using our 
hypothetical experiment). 
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Oiifint'it'v 

Vc^ Lit LJL JL Li X Li y 


RpTif'H'mFiT'k" 1 

J } 11 111(11 XV JL 


RPTIpVlTTlflTk" 2 

J f 1. / II V 1 1111 (11 XV md 


mo (GeV) 


1849.67 


3886.19 


mi io fGeV) 


563.78 


1152.82 


A (GeV) 


933.13 


1529.73 


tan ,8 


54.49 


53.09 


m t (GcV) 


172.50 


172.51 


m b {m, b ) MS (GeV) 


4.21 


4.21 


a s (m z ) MS 


0.1186 


0.1180 


l/«cm(mz) MS 


127.929 


127.954 


(Pb) 


8.3 xl0~ 8 


2.3 xl0~ 7 


m^o (GeV) 


235.49 


442.74 



Table 1. Values of model and nuisance parameters at our benchmark points, as well as the corresponding 
DD quantities aS 1 and m^a. 



Parameter 


Lower limit 


Upper limit 


mo (GeV) 


60 


4000 


mi/2 (GeV) 


60 


4000 


A (GeV) 


-7000 


7000 


tan/3 


2 


65 


mt (GcV) 


163.7 


178.1 


m b {m b ) MS (GeV) 


3.92 


4.48 


« s (m z ) M 


0.1096 


0.1256 


l/a cm {m z ) MS 


127.846 


127.990 



Table 2. Ranges of model and nuisance parameters employed in our scans. 



As the next step, we should simulate some pseudo-experiments based on our DD like- 
lihood function (Eq. 3.2). The process is as follows: we begin with a benchmark point and 
fix the true values of our parameters to the corresponding benchmark values. The likelihood 
then represents probabilities for obtaining particular sets of experimental data for the as- 
sumed true values of model parameters. Based on these probabilities, one can generate some 
random values for the involved random variables (or the data points). The random variables 
in our case include the number of observed DD events and the measured recoil energies. We 
simulate 100 pseudo-experiments (with 100 random sets of generated pseudo-data) for each 
benchmark point. The data is generated by taking a random number of events according to 
a Poisson distribution with average fi; each of these events is then given a random energy 
according to the distribution given by Eq. 3.3. We then feed each set of generated data into 
our scanning machinery as a new set of experimental data and scan over the parameters. 
The scans are performed over the eight model parameters with ranges given in Table 2. 

In order to ensure that our sample points are physically self-consistent, we assign ex- 
tremely small (almost zero) values to the likelihood for the unphysical points. These points 
are the ones for which no self-consistent solutions to the RGEs exist, EWSB conditions are 
not fulfilled, some particle masses are tachyonic, or the neutralino is not the LSP. In addi- 
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Observable 


Mean value 


Standard deviation 


Reference 


rot (GeV) 


172.6 


1.4 


[57] 


m b (m b ) MS (GeV) 


4.20 


0.07 


[34] 


a s {m z ) MS 


0.1176 


0.002 


[34] 


l/a cm (m z ) MS 


127.955 


0.03 


[58] 



Table 3. Measurements of SM nuisance parameters used in the analysis. 



tion to the DD data and physicality conditions, we also apply experimental constraints on 
the measurements of the SM nuisance parameters. In SuperBayeS, the likelihoods for these 
quantities are modelled by Gaussian functions; the mean values and standard deviations we 
use in our analysis are given in Table 3. 

For each scan, we calculate ID marginal posteriors and profile likelihoods for different 
model parameters and derived observables. Finally, for every parameter or observable and 
by looking at all the 100 scans, we count the number of times the corresponding benchmark 
value falls into a certain credible or confidence interval. This gives the approximate degree 
of coverage for the corresponding level of confidence. We perform this process for both our 
benchmark points and evaluate the coverage for 68.3% (la) and 95.4% (2a) confidence levels. 

For the scans over the parameter space, we employ both flat and logarithmic (log) priors. 
A log prior for a parameter 0i (which receives only positive values) is defined as uniform in 
ln#j. The main motivation for choosing log priors is that they represent prior 'ignorance' 
instead of prior knowledge. The use of these 'non-informative' priors is justified in attempts 
for reducing the subjectivity of Bayesian interval construction, although the actual power of 
Bayesian statistics is in that it provides the possibility of including prior knowledge rather 
than ignorance [33] . The other motivation for log priors in terms of the sampling algorithm 
is that samples based on uniform priors are not distributed uniformly in a hypersphere in the 
parameter space, which one usually requires in order to correctly map a multi-dimensional 
function (such as our likelihood). Log priors are believed to alleviate this problem (for an 
example where log priors are shown to improve the statistical coverage, see Ref. [59]). 

4 Results and discussion 

4.1 One-dimensional confidence intervals and coverage evaluations 

We summarise our results in Table 4. For each benchmark point and specific choice of 
priors, we list the number of times that the corresponding benchmark values for the CMSSM 
parameters mo, mi/2> and tan/3 fall within their ID 68.3% (la) or 95.4% (2a) confidence 
intervals. We also add the two derived quantities m^o and a^ 1 to the list because those are 
the main quantities upon which the generated data depend. Since we use 100 realisations 
for each pseudo-true benchmark point, the presented numbers span a range between and 
100. For comparison, we also give the coverage for Bayesian credible intervals, although 
we do not expect proper coverage in those cases. Numbers in red and black show under- 
and over-coverage, respectively, while numbers in green indicate correct coverage (to within 
errors) . 

In order to verify whether a particular number in the table corresponds to under-, 
over- or appropriate coverage, we use binomial errors on the trial counts for the considered 
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100 


15 




17 


47 








mx/2 




92 


2 


30 


1 


17 
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99 


100 


43 


91 


91 


100 





24 


tan/3 


93 


100 


16 


91 


99 


100 


38 


99 


m x? 


57 


88 


17 


37 


2 


15 


23 


23 


a SI 
p 






22 


65 


15 


59 





1 



Table 4. Results of our coverage evaluations. Each entry shows the number of times the hypothesised true 
value of a parameter or derived quantity falls into 68.3% (1<t) or 95.4% (2a) frequentist confidence or Bayesian 
credible intervals for 100 pseudo-experiments simulated based on the likelihood. The numbers are given for 
the two benchmark points we have chosen in our analysis and for both flat and log-prior scans. Numbers 
in red and black show under- and over-coverage, respectively, while the ones in green show correct coverage. 
Green numbers are the ones that fall within [64, 72] and [94, 97] intervals for la and 2a confidence levels, 
respectively. A binomial estimate of errors has been used to determine whether the coverage is correct in each 
case. 



confidence levels 68.3% and 95.4%. The errors are calculated as the square root of the 
expression 

a 2 = np(l-p) (4.1) 

for the variance of a binomial distribution. Here n is the number of trials, which is 100 in 
our case, and the result of each trial is assumed to be 'success' (where the confidence interval 
in question contains the pseudo-true value) with probability p, and 'failure' with probability 
1 — p. In our cases p is 0.683 and 0.954, for 68.3% (la) and 95.4% (2a) confidence levels, 
respectively. This means that a count of successes for a la (2a) confidence level shows exact 
coverage if it lies in the range [64, 72] ([94, 97]); it is then coloured green in Table 4. Under- 
and over-coverage are defined in a similar way for numbers outside the corresponding ranges 
for the reconstructed la or 2a intervals, [64,72] or [94,97], respectively (coloured red and 
black in Table 4). 

We should point out that our estimation here based on a binomial process is in fact an 
estimate of the statistical error that comes from having a finite number of pseudo-experiments 
to estimate the coverage with. In a similar coverage study for the CMSSM, based on an 
ATLAS likelihood [60], errors are estimated by first working out how much realisation noise 
is produced by MultiNest and the neural networks employed for accelerating the scans in 
that study. This provides the standard deviation of the upper and lower ends of the one- 
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dimensional confidence intervals on the CMSSM parameters when doing repeated scans for 
the same set of pseudo-data. All the derived confidence intervals (i.e. from every pseudo- 
experiment) are then shifted uniformly up or down by the standard deviation on the upper 
or lower limit, and error bars assigned to coverages based on the differences induced in the 
derived coverage values. This estimate provides a measure of the random bias (i.e. systematic 
error) induced by realisation noise in the scans, due to a finite number of scans (one) for each 
pseudo-experiment . 

In principle one should estimate both errors (the statistical, as in our calculations, and 
the systematic, as in Ref. [60]), and include both in the total error budget on coverages. 
On the basis of the error bars in Ref. [60] however, it seems that the possible sampling bias 
estimated in that analysis is usually smaller than the statistical error included here. The 
statistical error we calculate should thus mostly dominate in our case, so we are well justified 
in using only the binomial error in this study. Given the larger number of pseudo-experiments 
in Ref. [60] on the other hand, sampling noise (i.e. systematics) should dominate the error 
budget in that analysis, so the binomial error can be safely neglected. 

Before we discuss the results given in Table 4, we show in Fig. 1 how the outcomes 
of a typical scan using our generated data may look. The figure shows the obtained two- 
dimensional (2D) credible and confidence regions for Bayesian marginal posterior PDFs and 
frequentist profile likelihoods, respectively, for a scan with benchmark 1 representing the 
true parameter values, and flat priors imposed on mo and mi/2- Plots are given for the four 
CMSSM parameters mo, m^, Aq and tan/3, and the two derived quantities cr^ and m^o. 
The (jpt-m^o planes are particularly interesting for DD experiments. The inner and outer 
contours in each panel (in dark and light blue for marginal posteriors and in yellow and red 
for profile likelihoods) represent 68.3% (la) and 95.4% (2a) confidence levels, respectively. 
Black dots indicate the benchmark values corresponding to benchmark 1. We also show 
in Fig. 2 the outcomes of one of the scans for which the true values are located outside 
the la and 2a credible and confidence regions. For cases where the coverage requirement 
is appropriately fulfilled, these types of plots should appear in 31.7% and 4.6% of the scans 
for la and 2a regions, respectively. For cases with significant under-coverage, these occur in 
larger fractions of times. Fig. 1 and Fig. 2 also indicate how the inferred confidence regions 
may vary in different scans with different realisations of the experiment. 

Let us now look at Table 4. The two first columns, corresponding to the confidence 
intervals for benchmark 1 are dominated by over-coverage, especially for flat-prior scans. 
The under-coverage for log priors, which happens only for m\/2 and m^o, is not severe. In 
particular for m^ 2 in this case, the numbers are very close to the exact-coverage values 
and are likely to tend to those in the large number limit (i.e. when the number of pseudo- 
experiments tends to infinity). 

For benchmark 2, on the other hand, our results show severe under-coverage; only for 
Ao and tan/3 do the intervals over-cover the true parameters. By comparing the flat- and 
log-prior cases for benchmark 2, we observe that for cases with coverage less than the desired 
value, this under-coverage deteriorates significantly from flat to log priors, in particular for 

TTli/2 an d 

In general, for mo, m^o and a^ 1 , by going from benchmark 1 to benchmark 2, 

the coverage for frequentist confidence intervals always reduces. Our results also show that 
the coverage does not change remarkably for Aq and tan f3 by changing the benchmarks or 
priors; intervals always (strongly) over-cover the parameters. 
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Figure 1. Two-dimensional marginal posteriors and profile likelihoods for one typical coverage scan over 
the CMSSM parameter space using a set of random DD data generated from the hypothetical XENON10 
likelihood. In this case, pseudo-true parameter values are set to those of benchmark 1. Plots are shown in 
mo-mi/2 and Arj-tan/3 planes, as well as for the spin-independent scattering cross-section of the neutralino 
and a proton rjf 7 versus the neutralino mass m^o. The inner and outer contours in each panel represent 
68.3% (la) and 95.4% (2a) confidence levels, respectively. Black dots show the benchmark values. 



Looking at the number of times the pseudo-true values are contained within Bayesian 
credible intervals (and we do this only for comparison), analogous patterns can be recognised: 
overall, the degree of coverage decreases when going from benchmark 1 to benchmark 2, or 
from flat priors to log priors. Exceptions are, for example, both ler and 2a intervals for m^-o, 
where the coverage for benchmark 2 improves by changing priors from flat to log, and for 
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Figure 2. As in Fig. 1, but for a scan with la and 2<r credible and confidence regions that do not contain 
the true parameters. In cases with proper coverage these should happen in about 32% and 5% of the times 
for lex and 2rr regions, respectively. For cases with significant under-coverage (as exhibited in some of our 
results), these types of plots occur much more frequently. Again, pseudo-true parameter values are set to 
those of benchmark 1. 



tan/?, where the same happens by going from benchmark 1 to benchmark 2 in both flat- and 
log-prior cases. 

Finally, a general comparison of the numbers for frequentist confidence and Bayesian 
credible intervals indicates that the former offer a substantially better coverage than the 
latter. This observation is not surprising though because, as we stated earlier, the statistical 
coverage is a property that is required for frequentist statistics and one in general does not 
expect a proper coverage for Bayesian inference. 

We think that many of the features in our results come from some 'sampling effects' 
that are caused by the explicit and implicit priors imposed on different parameters and 
observables, as well as the complex structure of the parameter space induced by various 
physicality constraints. These priors and physicality requirements affect the sampling of 
the parameter space and consequently the mapping of the likelihood function. In the next 
section, we describe these sampling effects and explain some of the patterns we see in our 
results based on them. 

4.2 Sampling effects 

Before investigating effects of new data on the statistical inference for a model, one needs to 
examine how well one's scans sample different regions of the parameter space, independent 
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of any information given by the data. For Bayesian statistics, this is obviously an important 
step because this shows what the effects of priors (and physicality conditions) are on the final 
inference, i.e. how they impact posterior PDFs when data are added. This makes it possible 
to know how the data further constrain the parameter space (if at all). 

Since no scanning algorithm is flawless, and in most cases (including our implementa- 
tion) they proceed according to the posterior PDF, these 'effective priors' (i.e. the effects from 
the explicit priors and the physicality conditions) also influence the ability of an algorithm 
to map the likelihood function and construct frequentist confidence intervals. The reason is 
that such algorithms tend to sample some regions more than the others, and for relatively 
strong prior effects this can result in poor or even no mapping of some important parts of 
the parameter space. These effects can therefore dramatically change the entire statistical 
inference. Any improper construction of confidence intervals in these cases can consequently 
induce significant deficiencies in the statistical coverage which are no more than artefacts 
of the scanning algorithm. In particular, such biases are independent of the precision of 
the actual approximate technique used for constructing the frequentist confidence intervals. 
Although in general we do not expect exact coverage from the profile likelihood method, we 
do believe that many of the features we observe in our results arise predominantly from these 
sampling artefacts. 

The ideal way to analyse such effects on the coverage is to look at the multi-dimensional 
distribution of the total sample points generated by priors and physicality conditions only, and 
then compare the algorithm's tendencies in sampling different regions, in particular regions 
around the benchmark points selected for the coverage study. This is clearly impossible to 
display for a parameter space with more than three dimensions. 

Fortunately, the particular likelihood function we have chosen for our analysis, depends 
mainly on two derived quantities, i.e. the neutralino mass m^o and its spin-independent 
scattering cross-section with nucleons a SI . We therefore begin our discussion by showing 2D 
projections of the entire set of sample points on the m^o-dp 1 plane, in Fig. 3. These samples 
are products of a scan with only physicality conditions and direct experimental constraints 
on the SM nuisance parameters. The left panel represents the distributions when flat priors 
are imposed on all parameters whereas the right panel does the same for the case with log 
priors imposed on mo and mx/2- Dots and crosses indicate locations of our benchmarks 1 
and 2 on this plane, respectively. 

In both flat and log-prior scans, Fig. 3 indicates that the effective prior on neutralino 
masses and SI scattering cross-sections is such that a large range of cross-sections is sampled 
at low neutralino masses. In contrast, at high masses a much narrower range of cross-sections 
can be seen, centred on quite low values of a^ 1 . 

Such an effective prior mostly arises from the physicality requirement that the lightest 
neutralino be the LSP. The LSP constraint means that in order to avoid a sfermion LSP, we 
require m^o < mj be satisfied for all sfermions. For large neutralino masses, this excludes 
the existence of light squarks. Because squark exchange in the s-channel is in general a major 
contributor to the amplitude of SI neutralino-quark scattering (see e.g. Ref. [17]), precluding 
small squark masses suppresses nuclear-scattering cross-sections. Because no experimental 
constraints have been imposed on these scans, there are also basically no major contributions 
from special regions such as the focus point or Higgs funnel, where this deficit could poten- 
tially be compensated for by e.g. very low Higgs masses giving rise to large contributions to 
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Figure 3. Sample points for a scan with only physicality conditions and experimental constraints upon SM 
nuisance parameters, projected on the m^o-Up 1 plane (DD constraints are not imposed here). Left and right 
panels represent scans with flat and log priors, respectively. Green dots and crosses show our benchmark 
points 1 and 2, respectively. Samples are plotted using a thinning factor of 10. 



(Tp from Higgs-mediated diagrams. 4 

Because we also impose the EWSB physicality constraint, there is also an effective 
upper bound imposed on the mass of the lightest Higgs; this means that even for very large 
m^o, only relatively low Higgs masses are probed. This means that even as m^o continues 
to increase (implying an increase in the minimum values of tuq permitted by the neutralino 
LSP constraint), the absolute contribution of Higgs exchange to a^ 1 stays roughly constant. 
When combined with the suppression of the squark-exchange contribution, this means that 
Gp 1 becomes dominated by Higgs exchange at large m^o. This produces a small band of 
permitted SI cross-sections at large m^o, which is approximately constant with increasing 
mass, and exhibits a width essentially set by the small allowed range of Higgs masses. 

One feature of our results mentioned in the previous section is that the coverage for 
m^o and a^ 1 is reduced by switching from benchmark 1 to benchmark 2, in both flat and 
log priors (Table 4). This can be understood by comparing the locations of these two points 
in the m^o-Up 1 planes of Fig. 3. The effective priors induced on these two quantities bias 
the samples towards lower values of m^o and <7p 7 . Our scanning algorithm tends to sample 
points around benchmark 1 better than around benchmark 2, as this bias is stronger further 
from the dense region of the plot. By having sampled the region around benchmark 1 more 
accurately than that around benchmark 2, we expect to find more high-likelihood points there 
with a direct impact on the constructed confidence regions, resulting in better coverage. This 
also means that such effects should be enhanced for log priors, because the concentration of 
points drops more sharply with increasing mass and cross-section in the log prior scan. This 
is seen in our results: coverage drops when changing from flat to log priors at each benchmark 
point individually, and also drops more severely at benchmark 2 than benchmark 1. 

4 Roughly speaking, large m^o together with the neutralino LSP constraint also precludes low mo, which 
tends to prevent very low Higgs masses, also suppressing the amplitude of Higgs exchange diagrams. 
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Table 5. Comparison of our coverage results with low and high statistics. Here, the true parameters are 
set to those of benchmark 2, log priors are imposed on mo and and the coverage is evaluated for la 

and 2a confidence intervals. The low-statistics case (first two data columns from the left) is the same as the 
corresponding original coverage study in Table 4, and the high-statistics case (last two data columns from 
the left) is obtained by increasing the XENON exposure by a factor of 30. The quantity /is in each case is 
the expected number of signal events, which is about 13 and 367 for low and high statistics, respectively. As 
in Table 4, numbers in red and black show under- and over-coverage, respectively, while the ones in green 
show correct coverage. The high-statistics results are based on 20 trials and the binomial estimate of errors 
implies that the green numbers in this case are the ones that fall within [58, 78] and [91, 100] intervals for la 
and 2a confidence levels, respectively. 



One potential reason for observing poor coverage in a statistical analysis of a complex 
model like the CMSSM is that the true parameters may lie at the boundary of the accessible 
parameter space (produced by imposing physically conditions) causing a poor convergence 
of the likelihood ratio test statistics (used to construct approximate confidence intervals) to 
its asymptotic chi-square distribution. In other words, Wilks' theorem does not apply in this 
case [61, 62]. This effect is independent of the sampling accuracy and can happen even in the 
limit of good sampling. To see whether this is the case for our benchmarks i.e. whether they 
lie at the boundary or not, we have used a scan with artificial priors that force the scanning 
algorithm to sample regions of the parameter space in the vicinity of the benchmarks (i.e. the 
regions with rather high masses and cross-sections) with a higher resolution. We find a large 
cloud of CMSSM points in those regions with the benchmarks well inside the cloud. This 
clearly shows that the benchmarks do not lie at the boundary and there are many physical 
points around the benchmarks that have not been found by the normal usage of MultiNest, 
i.e. when flat or log priors are used. In addition, the described boundary effects always lead 
to over-coverage (as is shown e.g. in Ref. [60]). What we see is an opposite effect, present in 
parts of the parameter space where the neutralino LSP condition indirectly leads to under- 
sampling, as the (Bayesianally-optimised) scanning algorithm preferentially explores other 
parts of the parameter space. It therefore means that the under-coverage in our results is 
caused by poor sampling rather than the boundary effects. 

Additionally, because the sampling effect is the dominant effect in our results, we expect 
that higher statistics lead to better coverage, as they lead to a reduced influence of the 
effective priors placed on the scanning algorithm by the boundary conditions. We have 
tested this by performing a coverage study for benchmark 2 using the same XENONlO-like 
likelihood but with 30 times larger exposure. This provides substantially larger statistics (for 



-17- 



Quantity 


Benchmark 3 


m (GeV) 


3834.35 


mi/j (GeV) 


378.02 


A (GeV) 


2124.04 


tan /3 


22.46 


m t (GeV) 


174.38 


miKp (GeV) 


4.10 


« s (mz) MS 


0.1140 


l/a cm (mz) A/5 


127.926 


(Pb) 


1.0 xl0~ 10 


m^o (GeV) 


159.87 



Table 6. Values of model and nuisance parameters at benchmark 3, as well as the corresponding DD 
quantities a^ 1 and m^o. 



benchmark 2, for instance, the expected number of signal events (us) becomes about 367 
as opposed to about 13 for the lower-statistics case). We use log priors for these scans and 
compare our results with the corresponding ones in Table 4 (benchmark 2, confidence intervals 
and log priors). Due to the computational limitations, we have only performed 20 scans in 
this case and evaluated the coverage based on them. This means that in this case using the 
binomial estimate of errors implies that a count of successes for a la (2a) confidence level 
shows exact coverage if it lies in the range [58, 78] ([91, 100]). The new degrees of coverage 
compared to the current ones are given in Table 5. The significant improvement in the 
coverage for mo, mi/2, m ^o and a^ 1 clearly shows that for the same set of true parameters 
and by only increasing the statistics, one gets substantially better coverage in these cases. 

It is important to realise the consequences of Fig. 3 and our above interpretation of 
coverage in terms of sampling effects in the context of future attempts to constrain SUSY 
parameters with large-scale DD experiments [26]. Since effective priors force the scanning 
algorithm to explore lower masses and cross-sections with a larger resolution, one expects 
much more accurate likelihood mappings for those regions. The coverage is thus less of an 
issue for smaller cross-sections. This is interesting because if WIMPs exist and they happen 
to have very low nuclear scattering cross-sections, even future ton-scale experiments will be 
able to detect only a few recoil events. This means that the statistics will be very low and, 
consequently, without having sufficiently high-resolution likelihood mappings of regions close 
to the actual point, any attempts at making proper statistical inference will be in vain. 

We have tested this conjecture by performing another coverage study for a benchmark 
with m^o and a^ 1 lying within the high-sample region in Fig. 3, i.e. with a low neutralino 
mass and a small cross-section. Values of the CMSSM parameters, as well as the nuisance 
parameters and the DD quantities ap 1 and m^o at this benchmark 3 are given in Table 6. 
We present our coverage results in Table 7, where they are provided for confidence intervals 
and for cases of low and high statistics. Log priors have been used in the scans and the 
results are based on 100 trials for each case. The first two data columns, correspond to 
the low-statistics analysis in which our original XENON likelihood with an exposure of 1000 
kg-days is used. The number of expected signal events us in this case becomes effectively 
zero. For the high statistics case (last two data columns), we have increased the XENON 
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Table 7. Results of our coverage evaluations for benchmark 3 (Table 6) with a low neutralino mass and a 
small cross-section located within the high-sample region of Fig. 3. Here, log priors are imposed on mo and 
m-i/2 an d the coverage is evaluated for lcr and 2cr confidence intervals. The results are given for both low and 
high statistics. The low-statistics case (first two data columns from the left) corresponds to the same XENON 
likelihood as used in Table 4, and the high-statistics case (last two data columns from the left) is obtained by 
increasing the XENON exposure by a factor of 900. The quantity us in each case is the expected number of 
signal events, which is about and 13 for low and high statistics, respectively. All numbers in this case show 
over-coverage. 



exposure by a factor of 900 so as to get approximately the same number of expected signal 
events (~ 13) as in our previous cases of benchmarks 1 and 2. This makes the comparison 
of the cases easier. Our results show over-coverage for benchmark 3 (for both low and high 
statistics) and the coverage appears to be substantially higher (for most quantities) than the 
cases for benchmarks 1 and 2. The overall significant over-coverage observed here confirms 
our interpretations about the roles of sampling deficiency in obtaining poor coverage. This is 
because, as we see here, by moving to the high-sample region of Fig. 3, the high concentration 
of sample points at low cross-sections (and low masses) provides the high-resolution mapping 
required for having a good coverage. The reason for observing higher levels of over-coverage 
in the low-statistics case of Table 7 compared to the high-statistics one, is also understood. 
The absence of any signal events in most realisations of the experiment in the former case 
(with fis = 0) means that the constraints placed on the parameters by the data are so weak 
that the constructed confidence intervals become very large and almost always contain the 
true parameters. This clearly leads to a significant over-coverage. 

Fortunately for WIMPs with high cross-sections, on the other hand, the large number 
of signal events at future detectors provides sufficiently high statistics to overcome sampling 
effects (as we discussed earlier). Prospects for future SUSY parameter estimation at DD 
experiments would thus appear quite rosy. 

In comparison to the recent complementary CMSSM coverage study [60], based on 
reconstruction of an ATLAS benchmark point with mock ATLAS results, we find rather 
different coverage properties. Here, we find that intervals undercover for both our benchmark 
points 1 and 2, whilst Ref. [60] find overcoverage for their chosen benchmark and pseudo- 
data. Regarding our above discussion, the difference can likely be explained by the different 
locations of the chosen benchmarks; their benchmark lies deep in the well-sampled, low-mass 
part of the mSUGRA/CMSSM "bulk" region. As we argued above, the high concentration 
of samples in these regions greatly reduces the influence of sampling effects. This in turn 
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allows boundary effects to dominate. The over-coverage observed in these cases is then, as 
also argued by the authors of Ref. [60], likely to originate from the physical boundaries in 
the parameter space. 

In addition to the DD quantities m^o and &„, we see variations in coverage also for 
the CMSSM parameters, mo, fn\i2i an d tan/3, when we change our benchmark or priors 
(Table 4) . Most of the patterns we observe here can be explained by analysing 2D plots of the 
parameters versus the observables m^o and Cp . The sampling effects we discussed in terms of 
the DD quantities can propagate throughout the entire parameter space, impacting statistical 
inferences for other quantities like the CMSSM parameters themselves. To demonstrate these 
effects, let us therefore look at Figs. 4 and 5, where samples for the scan with only priors, 
physicality and nuisances are given in planes of parameters versus m^o and a^ 1 . Again, left 
and right panels correspond to flat and log priors, respectively, and our two benchmarks (1 
and 2) are shown by dots and crosses. 

The gaugino mass parameter m 1/ / 2 is perhaps the most obvious case. Table 4 shows 
that the coverage in this case reduces by changing the benchmark from 1 to 2, and this 
reduction is significantly greater when log priors are imposed. Comparison of mi/2 values 
with neutralino masses m^o (Figs. 4c, d) indicates a correlation between the two, as expected 
in the CMSSM. This is not surprising though when we recall the actual correlation between 
m^o and in the CMSSM. The thin straight line depicts the strong correlation between 
the two quantities, and this means that we expect to see very similar behaviours for the two 
cases in any statistical inference; these include variation patterns for the coverage. As an 
alternative way of explaining our observations for mi/2, we could instead look directly at 
Figs. 5c, d for mi/2 versus <Jp . Since the plots are very similar to the ones in Fig. 3 for m^o 
and <Jp , our previous arguments hold here as well. 

In order to understand our results in Table 4 for the scalar mass parameter mo, consider 
Figs. 5a,b for mo versus dp 1 in both flat and log priors. Let us first look at Fig. 5b for log 
priors. The bias in the distribution of the sample points in the figure toward lower mo and 
Gp 1 shows that the scanning algorithm tends to sample regions with lower values of these 
quantities. There are more sample points around benchmark 1 in this plane, meaning that the 
likelihood is scanned more accurately in the vicinity of benchmark 1 compared to benchmark 
2. This observation explains the large decrease in the coverage for the latter case. Fig. 5a 
for flat priors, shows similar patterns although they are not as strong as in Fig. 5b. This 
is completely consistent with our numbers for mo in Table 4 when flat priors are imposed: 
coverage does drop from benchmark 1 to 2, but less substantially than in the log prior case. 

Finally, for the remaining two CMSSM parameters, Aq and tan j3, Table 4 does not 
show large variations in coverage in different cases. These can again be explained from what 

we observe in Figs. 4e,f and Figs. 4g,h for Aq and tan/3 versus m^o in both flat and log 

1 

priors. Plots show that our benchmarks 1 and 2 are not considerably far from each other in 
these planes, and are hence sampled approximately equally well. Effects on the likelihood 
reconstruction of moving from one point to the other are therefore not large, and result in 
only slight changes to the coverage. We also do not see any major correlations between these 
parameters and er^ in Figs. 5e,f and Figs. 5g,h, confirming this interpretation. 

As we stated several times in the above discussions, our interpretation of the poor 
coverage in our results is that the issue mainly stems from the sampling effects that are 
induced by the imperfection of the employed scanning algorithm in accurately reconstructing 
the profile likelihoods rather than the breakdown of the profile likelihood approximation of 
the frequentist confidence intervals. This means that by using an algorithm that is optimised 
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Figure 4. As in Fig. 3, but for the CMSSM parameters mo, ^1/2? Aq and tan/5 versus the neutralino mass 
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Figure 5. As in Fig. 3 and Fig. 4, but for the CMSSM parameters mo, mi/2, -Arj and tan/3 versus the 
spin-independent scattering cross-section of the neutralino and a proton a^ 1 . 



-22- 





Benchmark 2 (conf. int.) 


1(7 


Z<7 


Flat + log priors 


to 


35 


72 


mi/2 


15 


55 


A 


92 


100 


tan (3 


100 


100 


A l 


23 


60 


V 


21 


63 



Table 8. As in Table 4, but when the sample points of the scans with flat and log priors are combined and 
the confidence intervals are constructed based on the total points. The table shows the combined analysis for 
benchmark 2 and confidence intervals only. Again, numbers in red and black show under- and over-coverage, 
respectively. 



for frequentist analysis of the parameter spaces, it should be possible to substantially alleviate 
the coverage issue in these cases. 

If one wants to adhere to the Bayesian techniques such as the one we have employed 
here, one potential solution would be to first explore the parameter space using different types 
of priors and then construct the confidence intervals from the combination of all sample 
points that are obtained this way. This is because each set of priors leads the algorithm 
to scan different parts of the parameter space with different resolutions and consequently 
the combined sample points would provide a better mapping of the likelihood over the entire 
parameter space. We investigate this for the particular cases of the flat and log priors utilised 
in our scans. The procedure is the following: For every flat-prior scan we have performed, 
we randomly choose one of our 100 log-prior scans and then merge the two sample sets into 
a new one. We repeat this for all 100 flat-prior scans and in the end we obtain 100 new 
sets of sample points. We then construct the one-dimensional profile likelihoods for each set 
and study the degrees of coverage in exactly the same way as before. Our coverage results 
for benchmark 2 and confidence intervals are given in Table 8. The numbers clearly show 
that combining the two sets of sample points does not improve the coverage in this case 
(compare with Table 4). For parameters too, rn 1 / 2 , to^o and a^ 1 , where both flat- and log- 
prior cases show under-coverage in Table 4, we again observe severe under-coverage in the 
combined case. While the coverage improves compared to the case with log priors only, it 
deteriorates compared to the flat-prior case. We discuss a possible reason for this somewhat 
counter-intuitive result at the end of this section. For A$ and tan/3 on the other hand, 
where both flat and log priors of Table 4 show similar over-coverage, the combined results 
also show over-coverage with numbers very similar to the individual cases. Although our 
results indicate that here flat-prior scans provide better coverage than the log and (flat+log) 
scans, they suggest that for cases where flat and log priors give over- and under-coverage, 
respectively, their combination could result in a proper coverage. 

Another way to improve the coverage when Bayesian scanning algorithms are used is to 
modify the configuration of the algorithms (if possible) so as to make them more optimised 
for frequentist analyses. In the coverage evaluations of this paper, we use MultiNest with 
a configuration that is widely used in SUSY parameter estimation (see e.g. Ref. [19]): the 
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number of live points, the tolerance parameter and the maximum efficiency are set to 4000, 
0.5 and 1, respectively. This corresponds to around 50000 points in the final set of sam- 
ples for each scan. After the initial submission of this paper however, a new configuration 
of MultiNest was suggested [63] that maps profile likelihoods from present-day data much more 
accurately. Here a smaller tolerance parameter (0.0004) and a significantly higher number of 
live points (20000) are used, and the algorithm requires substantially longer run-time. We 
have investigated in a few individual parameter scans whether running MultiNest with this 
new configuration leads to substantially different intervals for the profile likelihoods than the 
ones we have obtained in the present work. The results are however not conclusive. In some 
cases, we obtain larger contours for the confidence regions, suggesting that running with the 
new configuration might increase the coverage. It is however possible that the changes we 
observe are solely due to sampling noise, rather than any real improvement in the coverage. 

Although improving the mapping of the profile likelihood should typically result in im- 
proved coverage in situations where under- or over-coverage can be attributed to deficiencies 
in scanning techniques, this will also not necessarily be borne out in the results of every 
individual parameter scan. This is because there are two competing effects in mapping the 
likelihoods with MultiNest: (1) The scan with the new configuration might produce larger 
contours because it finds more points which lie above the threshold likelihood. (2) It might 
produce smaller contours, because it should find a higher best-fit point, shifting the thresh- 
olds for 1- and 2a up. This may also explain the curious result discussed above that merged 
(flat+log) prior scans sometimes show poorer coverage than simple flat prior scans. In order 
to conclusively determine whether changing the MultiNest parameters improves the coverage 
or not, one needs to perform a complete coverage study using the new configuration, which 
is well beyond the scope of the current paper given the massive computational resources 
required for such an investigation. 

5 Conclusions 

We have investigated the ability of advanced Bayesian scanning techniques to correctly 
construct frequentist confidence intervals for supersymmetric parameter estimation. Our 
construction has been based on one-dimensional profile likelihoods obtained by maximising 
the full multi-dimensional likelihood function over all unwanted parameters. We chose the 
CMSSM for our analysis, giving us four free continuous parameters. We added to this pa- 
rameter space four SM quantities as nuisance parameters. We assessed correctness of the 
intervals by evaluating the statistical coverage for two benchmark points in the parameter 
space when either flat or logarithmic priors are imposed on the GUT-scale unified mass pa- 
rameters rriQ and rrii/2- We used a likelihood for direct detection of dark matter particles with 
a XENONlO-like experiment. This gave us a rather high-statistics likelihood. We explored 
the parameter space with MultiNest, a state-of-the-art scanning technique based on nested 
sampling and we set the configuration parameters of the algorithm to the commonly used 
values optimised for Bayesian inference. 

Our results show poor coverage for most cases. We see both over- and under-coverage 
that vary for different benchmarks and priors. We studied effects of sampling biases caused 
by the explicit priors as well as the physicality constraints imposed on different regions of 
the parameter space, and showed that these effects can be very strong in some cases. This 
causes the sampling algorithm to explore different regions with different resolutions, which 
consequently impacts its ability to accurately map the likelihood function. Some regions 
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with direct impact on the statistical inference and construction of the confidence intervals 
can be entirely missed. Our interpretations of the observed poor coverage (especially in 
under-coverage cases) were mainly based on these sampling effects, i.e. the inadequacy of 
the Bayesianally-optimised scanning algorithms in constructing profile likelihoods. Although 
poor coverage can arise alternatively from the breakdown of profile likelihood approximation 
of the full Neyman construction of confidence intervals, we did not consider this as the 
dominate effect in our analysis. This effect usually happens when the true parameters lie at 
the boundary of the accessible parameter space which for SUSY models is set by physicality 
constraints. Our benchmarks however did not seem to be located at the boundary. In 
addition, the boundary effects always lead to over-coverage, contrary to the fact that our 
results show also substantial under-coverage in various cases. 

The sampling effects are expected to significantly weaken in situations where very high- 
statistics data are available. This could happen with, for example, collider data from the 
LHC or DM DD data from the future ton-scale experiments. 

We have seen that imposing log priors does not help improve coverage, and that re- 
sults even deteriorate in most cases. In addition, the confidence intervals constructed from 
the combination of flat- and log-prior samples do not show proper coverage. This means 
that if one wants to use the scanning techniques that are optimised for Bayesian inference 
to correctly estimate parameters of a complex model such as the CMSSM in a frequentist 
framework, other more sophisticated types of priors should be employed. Some Bayesian 
techniques, including MultiNest, can be configured in such a way that become more appro- 
priate for profile likelihood analyses. Utilising these different configurations should provide a 
substantially better coverage given that they map the likelihood accurately enough and the 
poor coverage is predominantly due to the sampling effects. It is however entirely possible 
for a complex parameter space that the profile likelihood does not approximate the full Ney- 
man construction accurately enough, and consequently, the confidence intervals constructed 
based on it do not cover the true parameters properly. In these cases, one can replace the 
profile likelihood method with other more sophisticated construction methods, such as the 
one suggested by Feldman & Cousins [33]. 

Given that both MCMCs and nested sampling are optimised for calculating the Bayesian 
posterior PDF and not the profile likelihood, there is reason to suspect that the MCMC 
techniques employed in other frequentist scans of the CMSSM (e.g. Refs. [64-66]) might also 
show similarly poor coverage. An investigation of the coverage properties of these techniques 
has yet to be presented, and would be a very interesting addition to the literature. 

Our analysis has been limited to a very specific example of SUSY parameter estimation. 
The model is reduced to the CMSSM, the likelihood is restricted to a particular DD experi- 
ment and uncertainties in the halo model and neutralino-quark couplings are ignored (see e.g. 
Refs. [26, 67] for the impacts of these uncertainties on the CMSSM parameter estimation). 
We made these choices particularly due to our limited computational power for generating 
the required pseudo-data. Ideally one should take into account likelihoods from all available 
astrophysical and collider experiments in a global-fit framework, including as many nuisance 
parameters as possible. Studying the coverage in these cases is of great importance for any 
frequentist inference on the model. The number of pseudo-experiments we used here was 
sufficient to demonstrate that there are issues with coverage for our particular chosen case. 
This number could be increased in the future to obtain more accurate estimates of the cov- 
erage of such global fits. This however requires substantial computing resources and more 
efficient scanning techniques (for a coverage study of the CMSSM with ATLAS likelihoods 
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and accelerated scans based on neural networks, see Ref. [60]); we leave the investigation of 
such cases for future work. 
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